Skip to content

Conversation

iajoiner
Copy link
Contributor

@iajoiner iajoiner commented Nov 21, 2021

  • closes ENH: to_orc #43864
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

@pep8speaks
Copy link

pep8speaks commented Nov 21, 2021

Hello @iajoiner! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-06-13 21:37:24 UTC

@iajoiner iajoiner marked this pull request as draft November 21, 2021 08:29
@iajoiner
Copy link
Contributor Author

iajoiner commented Nov 21, 2021

Thanks @NickFillot for your effort in integrating my ORC writer into Pandas. Now since your PR has been closed I will take care of modifying it and getting it approved from now on.

Here is the link to Nick's PR which I unfortunately can not reopen: #43860

@iajoiner iajoiner marked this pull request as ready for review November 21, 2021 09:29
@iajoiner iajoiner changed the title Feature to orc [EHN] pandas.DataFrame.to_orc Nov 21, 2021
@iajoiner iajoiner mentioned this pull request Nov 21, 2021
@lithomas1 lithomas1 added Arrow pyarrow functionality Enhancement IO Data IO issues that don't fit into a more specific label labels Nov 23, 2021
@iajoiner
Copy link
Contributor Author

Since apache/arrow#9702 will significantly add to the ORC writer API shall we delay the merge until late Jan when Arrow 7.0.0 is released?

@twoertwein
Copy link
Member

twoertwein commented Nov 25, 2021

Since apache/arrow#9702 will significantly add to the ORC writer API shall we delay the merge until late Jan when Arrow 7.0.0 is released?

I can't judge this. Probably depends at least on 1) whether it would create inconsistencies (early implementation and then with the new pyarrow later) and on 2) how soon you want it (pandas 1.4 is scheduled for New Year's Eve, 1.5/2.0 will then probably be a year after that).

@iajoiner
Copy link
Contributor Author

@twoertwein I don't think there will be inconsistencies. I'm really just adding optional writer arguments that can be **kwargs in Pandas. Yes I want it out there ASAP.

@iajoiner
Copy link
Contributor Author

iajoiner commented Dec 3, 2021

@github-actions pre-commit

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 5, 2022
@iajoiner
Copy link
Contributor Author

iajoiner commented Jan 5, 2022

I’m still around. Will update this PR this month. It’s just that I have a few Arrow tickets to deal with now.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

love to have this, if you can merge master and address comments we can look again

@jreback jreback removed the Stale label Jan 16, 2022
a bytes object is returned.
engine : {{'pyarrow'}}, default 'pyarrow'
ORC library to use, or library it self, checked with 'pyarrow' name
and version >= 7.0.0. Raises ValueError if it is anything but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO Raises ValueError if it is anything but... is redundant with the raises section below so I think this can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I will fix that tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Co-authored-by: Matthew Roeschke <[email protected]>
# If unsupported dtypes are found raise NotImplementedError
for dtype in df.dtypes:
dtype_str = dtype.__str__().lower()
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will pyarrow raise if these dtypes are passed? If so, can a a pyarrow error be caught and reraised as a NotImplementedError so this can be more flexible to other potential dtypes not supported in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to test these types individually. Not sure right now.

Copy link
Contributor Author

@iajoiner iajoiner Jun 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke It seg faults out for all instances but sparse. I need to catch them in Arrow 9.0.0. Meanwhile can we use the current dtype filter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is fine then given:

  1. Could you use the type checking functions in pandas.core.dtypes.common instead? e.g. is_categorical_dtype(dtype)?
  2. Could you make a note that in pyarrow 9.0.0 this checking should not be needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

@iajoiner iajoiner Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Since for sparse dtypes we get a TypeError from Arrow when converting the dataframe to a pyarrow table I plan to use TypeError for the other 4 in pyarrow 9.0.0 as well. The try-except block has been added in addition to the type checks for the 4 that segfault out right now with the note.

}
expected = pd.DataFrame.from_dict(data)

outputfile = os.path.join(dirpath, "TestOrcFile.testReadWrite.orc")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use tm.ensure_clean("TestOrcFile.testReadWrite.orc") as a context manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@iajoiner
Copy link
Contributor Author

iajoiner commented Jun 7, 2022

@mroeschke Really thanks for reviewing this PR! I have one question about What’s New. Does the ORC writer qualify as a major enhancement? Personally I think it does and is the 4th most important major enhancement in 1.5.0.

def test_orc_writer_dtypes_not_supported(df_not_supported):
# GH44554
# PyArrow gained ORC write support with the current argument order
msg = """The dtype of one or more columns is unsigned integers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Single quotes please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this wasn't changed still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Now it has been changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Ah that’s because black replaces single quotes with double ones automatically.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's also a pandas/tests/io/data/orc/TestOrcFile.testReadWrite.orc file that was added that can just be created/destroyed in the test.

@mroeschke
Copy link
Member

@mroeschke Really thanks for reviewing this PR! I have one question about What’s New. Does the ORC writer qualify as a major enhancement? Personally I think it does and is the 4th most important major enhancement in 1.5.0.

Sure! I think the whatsnew entry can be a small section instead of a one line mention

@iajoiner
Copy link
Contributor Author

iajoiner commented Jun 12, 2022

@mroeschke The only red is likely transient and completely unrelated to the PR. What's new has been updated.

The only issue you pointed out which I haven't fixed is the PyArrow error issue since unsupport dtypes often cause segfaults in pyarrow. I have filed an Arrow ticket to fix it here: https://issues.apache.org/jira/browse/ARROW-16817 . Meanwhile due to the fact that Pandas 1.5.0 will be released before Arrow 9.0.0 in order not to have segfault when users do try to use unsupported dtypes can we use a filter in Pandas right now? We can then change it once the Arrow issue is fixed.

the RangeIndex will be stored as a range in the metadata so it
doesn't require much space and is faster. Other indexes will
be included as columns in the file output.
**kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you name this engine_kwargs and have this take a Dict[str, Any] instead? It's the pattern we've been using in other methods.

Also is there there documentation you can link from pyarrow on what other engine keyword arguments can be accepted?

Copy link
Contributor Author

@iajoiner iajoiner Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke You mean just like the excel methods but without having to support the legacy **kwargs approach? Sure!

I've followed to_feather convention and added :func:pyarrow.orc.write_table which should link to the correct documentation.

(e.g. via builtin open function). If path is None,
a bytes object is returned.
engine : str, default 'pyarrow'
ORC library to use, or library it self, checked with 'pyarrow' name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ORC library to use, or library it self, checked with 'pyarrow' name
ORC library to use. Pyarrow must be >= 7.0.0.

Library it self is a little unclear to me here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Fixed!

@mroeschke
Copy link
Member

@mroeschke The only red is likely transient and completely unrelated to the PR. What's new has been updated.

The only issue you pointed out which I haven't fixed is the PyArrow error issue since unsupport dtypes often cause segfaults in pyarrow. I have filed an Arrow ticket to fix it here: https://issues.apache.org/jira/browse/ARROW-16817 . Meanwhile due to the fact that Pandas 1.5.0 will be released before Arrow 9.0.0 in order not to have segfault when users do try to use unsupported dtypes can we use a filter in Pandas right now? We can then change it once the Arrow issue is fixed.

Sure sounds good. Can use manual dtype filtering for now with some corrections in my review.

Could you also uncommit pandas/tests/io/data/orc/TestOrcFile.testReadWrite.orc?

@iajoiner
Copy link
Contributor Author

iajoiner commented Jun 13, 2022

@mroeschke Please review again. All issues have been taken care of with the exception of using single quotes for msg which is not possible due to black.

P.S. Red stuff is unrelated to the PR.

pandas/io/orc.py Outdated
*,
engine: Literal["pyarrow"] = "pyarrow",
index: bool | None = None,
engine_kwargs: dict[str, Any] = {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just realized. Could you default this here engine_kwargs: dict[str, Any] | None = None, and if None convert to empty dict in the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with it! One more minor comment then LGTM

@iajoiner
Copy link
Contributor Author

iajoiner commented Jun 13, 2022

@mroeschke Fixed haha. Really thanks for reviewing! Is the next step merging once it is clear that no error related to the PR exists? :)

@mroeschke mroeschke merged commit 15902bd into pandas-dev:main Jun 14, 2022
@mroeschke
Copy link
Member

Awesome thanks for the responsiveness down stretch!

@iajoiner iajoiner deleted the feature-to-orc branch June 14, 2022 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Enhancement IO Data IO issues that don't fit into a more specific label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: to_orc

8 participants